Stream processor with cryptographic co-processor

ABSTRACT

A microprocessor includes a first processing core, a first cryptographic coprocessor and an integer multiplier unit that is coupled to the first processing core and the first cryptographic co-processor. The first processing core includes an instruction decode unit, an instruction execution unit, a load/store unit. The first cryptographic coprocessor is located on a first die with the first processing core. The first cryptographic co-processor includes a cryptographic control register, a direct memory access engine that is coupled to the load/store unit in the first processing core and a cryptographic memory.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority from U.S. Provisional PatentApplication No. 60/345,315 filed on Oct. 22, 2001 and entitled “HighPerformance Web Server,” which is incorporated herein by reference inits entirety for all purposes.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates generally to microprocessors, andmore particularly, to microprocessors that include a cryptographicco-processor within the microprocessor die.

[0004] 2. Description of the Related Art

[0005] Server computers (i.e., servers) process all sorts of datatransactions. One common type of data transaction is an encrypted datatransaction that typically requires the server to perform at least oneof an encryption function and a decryption function. FIG. 1 shows atypical server 102 and client computer 110 that are linked by a network104, such as the Internet or other network.

[0006]FIG. 2 is a high-level block diagram of a typical server 102. Asshown, the server 102 includes a processor 202, ROM 204, and RAM 206,each connected by a peripheral bus system 208. The peripheral bus system208 may include one or more buses connected to each other throughvarious bridges, controllers and/or adapters, such as are well known inthe art. For example, the peripheral bus system 208 may include a“system bus” that is connected through an adapter to one or moreexpansion buses, such as a Peripheral Component Interconnect (PCI) bus.Also coupled to the peripheral bus system 208 are a mass storage device210, a network interface 212, a number (N) of input/output (I/O) devices216-1 through 216-N and a peripheral cryptographic processor 220.

[0007] I/O devices 216-1 through 216-N may include, for example, akeyboard, a pointing device, a display device and/or other conventionalI/O devices. Mass storage device 210 may include any suitable device forstoring large volumes of data, such as a magnetic disk or tape,magneto-optical (MO) storage device, or any of various types of DigitalVersatile Disk (DVD) or Compact Disk (CD) based storage.

[0008] The peripheral cryptographic processor 220 (i.e.,crypto-processor) is linked to the processor 202 by the peripheral bussystem 208. The crypto-processor 220 performs encryption and decryptionoperations that may be necessary for encrypted data transactions such asbetween the server 102 and the client 110. In some servers thecrypto-processor 220 can also be external to the server 102 and linkedto the processor 202 by one of the I/O devices 216-1 through 216-N.

[0009] Network interface 212 provides data communication between thecomputer system and other computer systems on the network 104. Hence,network interface 212 may be any device suitable for or enabling theserver 102 to communicate data with a remote processing system (e.g.,client computer 110) over a data communication link, such as aconventional telephone modem, an Integrated Services Digital Network(ISDN) adapter, a Digital Subscriber Line (DSL) adapter, a cable modem,a satellite transceiver, an Ethernet adapter, or the like.

[0010] Typically the processor 202 can operate at clock speeds of up toor more than 1 GHz. Conversely, the peripheral bus system 208 typicallyoperates at a substantially slower speed such as about 166 MHz orsimilar. Further, the crypto-processor 220 typically operates at a speedsimilar to the peripheral bus system 208. This is because thecrypto-processor 220 cannot process data any faster than the data can betransported across the peripheral bus system 208. Further, thecrypto-processor 220 is typically a customized, specialized processor(i.e. an application specific integrated circuit (ASIC)) that may not bemade by the latest, highest performance manufacturing technologies andtherefore the maximum processing speed (i.e., the crypto-processor clockspeed) of the crypto-processor 220 is substantially less than themaximum processing speed of the processor 202.

[0011]FIG. 3 is a flowchart diagram of the method operations 300 of atypical encrypted data transaction within the server 102. The encrypteddata transaction can be any data transaction that required encryption,decryption or both encryption and decryption such as an e-commercetransaction between the server 102 and the client computer 110. Inoperation 305, data is received in the server 102 such as from theclient computer 110 or because of a request by the client computer 110.

[0012] In operation 310, the received data is analyzed to determine ifthe received data is encrypted. For example, the data may be encryptedbecause the data includes a user's personal and/or financial data orother data that is transported during an encrypted session such as SSL(secure sockets layer) or other encryption methods.

[0013] If the received data is found to not be encrypted data, inoperation 310, then the received data is processed as described inoperation 330 below. Alternatively, if, in operation 310, the receiveddata is determined to be encrypted data, then, in operation 315, theencrypted data is sent to the peripheral crypto processor 220 via theperipheral bus system 208.

[0014] In operation 320, the crypto processor 220 decrypts the encrypteddata. In operation 325, the crypto processor 220 outputs the decrypteddata to the processor 202 via the peripheral bus system 208. Inoperation 330, the processor 202 processes the data to produce resultdata.

[0015] In operation 335, the result data is analyzed to determine if theresult data should be encrypted. If the result data does not requireencryption, then the processor outputs the result data to the client110, in operation 340, and the method operations end. Alternatively, if,in operation 335, the result data required encryption, then in operation345, the processor outputs the result data to the crypto-processor viathe peripheral bus system 208.

[0016] In operation 350, the crypto processor 220 encrypts the resultdata. In operation 355, the crypto processor 220 outputs the encryptedresult data to the processor 202 via the peripheral bus system 208. Inoperation 360, the processor outputs the encrypted result data to theclient 110 and the method operations end.

[0017] Transferring the data to be encrypted, decrypted or processedbetween the crypto processor 220 and the processor 202 is very slow.Further, the slower processing speed of the crypto processor 220 alsolimits the rate at which the data is encrypted or decrypted. Further, ifa large volume of data such as streaming data (e.g., streaming audio,streaming video, etc.) is being encrypted and/or decrypted then the ratethe server 102 can serve the streaming data is limited by the rate atwhich the streaming data can be encrypted and/or decrypted. Furtherstill, the multiple transfers of the streaming data between the cryptoprocessor 220 and the processor 202 can dominate the usage of theperipheral bus system 208 and the I/O systems inside the cryptoprocessor 220 and the processor 202, thereby limiting further theability of the processor 202 to perform any functions other thantransferring data to and from the crypto processor 220.

[0018] In view of the foregoing, there is a need for a system and methodfor increased and/or more efficient data encryption and decryptionprocess speeds.

SUMMARY OF THE INVENTION

[0019] Broadly speaking, the present invention fills these needs by asystem and method for increased and/or more efficient data encryptionand decryption process speeds. It should be appreciated that the presentinvention can be implemented in numerous ways, including as a process,an apparatus, a system, computer readable media, or a device. Severalinventive embodiments of the present invention are described below.

[0020] One embodiment includes a microprocessor includes a firstprocessing core, a first cryptographic co-processor and an integermultiplier unit that is coupled to the first processing core and thecryptographic co-processor. The first processing core includes aninstruction decode unit, an instruction execution unit, a load/storeunit. The first cryptographic co-processor is located on a first diewith the first processing core. The cryptographic co-processor includesa cryptographic control register, a direct memory access engine that iscoupled to the load/store unit in the first processing core and acryptographic memory.

[0021] The integer multiplier unit can be included within the firstprocessing core or within the first cryptographic co-processor.

[0022] The cryptographic memory is at least large enough to perform aMontgomery multiplication function.

[0023] In one embodiment, the integer multiplier unit is a 64-bit×64-bitmultiplier unit.

[0024] The load/store unit can be coupled to a main memory systemhierarchy.

[0025] The first processing core is coupled to a second processing coreby a processor crossbar. The second processing core is coupled to asecond cryptographic coprocessor that is located on a second die withthe second processing core. Alternatively, the second processing coreand the second cryptographic co-processor can be located on the firstdie.

[0026] The first cryptographic co-processor can be coupled to theinstruction decoder unit.

[0027] The first cryptographic co-processor and the first processingcore share the integer multiplier unit.

[0028] The direct memory access engine can be coupled to the load/storeunit by a 64 bit data bus.

[0029] The cryptographic control register can include data thatidentifies a type of cryptographic instruction received in the firstcryptographic co-processor.

[0030] One alternative embodiment includes a method of executing acryptographic command. A cryptographic instruction is received in anload store unit in a processing core on a first die. The cryptographicinstruction is analyzed to determine if the cryptographic instruction isa crypto store instruction. If the cryptographic instruction is a cryptostore instruction, then a source operand of the crypto store instructionis stored in a crypto control register in a cryptographic co-processoron the first die. The source operand is analyzed to determine if thesource operand identifies a corresponding crypto command. If the sourceoperand identifies the corresponding crypto command, the correspondingcrypto command is executed.

[0031] The cryptographic co-processor can also send an interrupt to aninstruction execution unit that is included in the processing core suchas when execution of a crypto command is completed.

[0032] A result of the cryptographic instruction can also be output to amemory system using a load store unit that is included in the processingcore.

[0033] Execution of the cryptographic instruction in the cryptographicco-processor can also include accessing data through the load storeunit.

[0034] The cryptographic co-processor can also include a direct memoryaccess engine. Accessing data can also include loading and storing datain a main memory.

[0035] Executing the cryptographic instruction in the cryptographicco-processor can also include executing a multiplication function. Thecryptographic co-processor can include an integer multiplier unit forexecuting the multiplication function.

[0036] The various embodiments of the present invention provide theability for a crypto processor to rapidly encrypt and/or decrypt data,such as streaming data, at rates much greater than possible by a priorart crypto processor.

[0037] Other aspects and advantages of the invention will becomeapparent from the following detailed description, taken in conjunctionwith the accompanying drawings, illustrating by way of example theprinciples of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings, andlike reference numerals designate like structural elements.

[0039]FIG. 1 shows a typical server and client computer that are linkedby a network, such as the Internet or other network.

[0040]FIG. 2 is a high-level block diagram of a typical server.

[0041]FIG. 3 is a flowchart diagram of the method operations of atypical encrypted data transaction within the server.

[0042]FIG. 4 shows a single CPU die (chip) in accordance with oneembodiment of the present invention.

[0043]FIG. 5 shows a detailed view of the processor core andcryptographic coprocessor in accordance with one embodiment of thepresent invention.

[0044]FIG. 6 is a flowchart 600 of the method operations of the pairedprocessor 410 and crypto co-processor 420 according to one embodiment ofthe present invention.

DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS

[0045] Several exemplary embodiments for a system and method forincreased and/or more efficient data encryption and decryption processspeeds will now be described. It will be apparent to those skilled inthe art that the present invention may be practiced without some or allof the specific details set forth herein.

[0046]FIG. 4 shows a single CPU die (chip) 400 in accordance with oneembodiment of the present invention. The CPU chip 400 includes aprocessing core 410. The processing core 410 is paired with acryptographic co-processor 420. The cryptographic co-processor 420 isoptimized to minimize the amount of hardware added to the CPU die 400.The CPU die 400 can also include additional processing cores 410 n andeach of the additional processing cores 410 n is also paired with acryptographic co-processor 420 n. The processing cores 410, 410 n areelectrically coupled by a processor crossbar 430. The processor crossbar430 is a data bus system that provides a common data communication linkbetween the processing cores 410, 410 n and other common devices thatmay be accessed by the processing cores 410, 410 n such as memorysystems and input/output (I/O) systems.

[0047]FIG. 5 shows a detailed view of the processor core 410 andcryptographic coprocessor 420 in accordance with one embodiment of thepresent invention. The processor core 410 includes an instructiondecode/trap handler unit 504, an instruction execution unit 506, a loadstore unit 508 and an integer multiplier unit 514. An instruction cache502 is coupled to the input of the instruction decode/trap handler unit504. A data cache 520 is coupled to the load store unit 508. The datacache 520 and the instruction cache 502 are also coupled to theprocessor crossbar 430. The data cache 520 can be a level-1 cache andcan be between about 4 kb and about 64 kb in size.

[0048] The crypto-coprocessor 420 includes a crypto control register510, a crypto memory 516, a DMA engine 518 and the crypto co-processorcore 512. The crypto control register 510 stores the settings of thecrypto co-processor 420. The settings can include identifying the typeof encryption or decryption command or operation to be performed. Thetypes of encryption or decryption command can include any of theencryption and decryption schemes known in the art. The crypto controlregister 510 also stores the status of the current crypto operations andis accessible by the processor 410 so that the processor 410 can checkthe status of the current crypto operations. The crypto control register510 is linked to the load store unit 508 by a logical link 530, whichrepresents the bi-directional interchange of data between the load storeunit 508 and the crypto control register 510.

[0049] The DMA engine 518 is coupled to the load store unit 508 by acrypto bus 522. The DMA engine 518 provides more direct access to thememory system hierarchy, such as the main memory, the data cache 520 anda level-2 cache, that can be accessed via the load store unit 508 and,if necessary, the processor crossbar 430. The crypto bus 522 can be aswide as feasible to enable rapid data transfer between the cryptocoprocessor 420 and the processing core 410. In one embodiment, thecrypto bus 522 is a 64-bit bus.

[0050] The crypto memory 516 is sufficiently large enough to hold theoperands and results for a particular crypto operation. By way ofexample, in an RSA decryption application, the crypto memory 516 isabout 1.3 KB, which is large enough for a modular exponentiation on2048-bit keys.

[0051] As shown in FIG. 5, the integer multiplier unit 514 is includedwithin the crypto processor core 512 but directly accessible by theprocessor core 410 by a logical link 532. Alternatively, the integermultiplier unit 514 can be part of the processor core 410 as long as thecrypto processor core 512 can directly access the integer multiplierunit 514. In this manner the crypto processor core 512 and the processorcore 410 can share the integer multiplier unit 514 so as to reduce thespace used (i.e., number of devices required) on the die. Typical, priorart crypto processors were not included on the die 400 with theprocessor core because the crypto processors consumed too much valuablespace on the die that was needed more for the processor core.

[0052] Sharing several components significantly reduces the “footprint”of the crypto processor so as to allow the crypto processor to be placedon the same die 400 as the processor core. The integer multiplier unit514 performs the modular multiply and modular exponentiation functions.The processor core 410 only uses the integer multiplier unit 514 about2-5% of the time. Therefore the crypto processor 420 can use the integermultiplier 514 about 95-98% of the time without impacting operationswithin the processor core 410. Moving the integer multiplier unit 514into the crypto processor 420 further streamlines the crypto functionsof the integer multiplier unit 514.

[0053] The integer multiplier unit 514 is capable of performing modulararithmetic such as Montgomery multiply functions and exponentiation. AMontgomery multiply function is a technique for performing modularmultiplication on a large integer (e.g. a 2048 bit number) using twomultiplications rather than a multiplication and a division. The integermultiplier unit 514 can be a 64-bit×64-bit integer multiplier unit. A 64bit×64-bit integer multiplication unit can directly access operands thatare stored in the crypto memory 516 rather than flooding the crypto bus522 and the load store unit 508 every clock cycle. Flooding the cryptobus 522 and the load store unit 508 every clock cycle would effectivelystall the processor core 410 because the load store unit 508 would onlybe able to address the demands of the crypto processor 420. Having acrypto memory 516 that is sufficiently large enough to perform acomplete modular exponentiation relieves the data throughput load on thecrypto bus 522 and the load store unit 508 and thereby allows theprocessor core 410 and the crypto processor 420 to operate insimultaneously on different operations and functions for many clockcycles.

[0054]FIG. 6 is a flowchart 600 of the method operations of the pairedprocessor 410 and crypto co-processor 420 according to one embodiment ofthe present invention. The instruction cache 502 temporarily stores thenext instruction to be executed in the processor core 410. In operation605 the next instruction is received in the instruction decode/traphandler unit 504 from the instruction cache 502.

[0055] The received instruction is forwarded to the instructionexecution unit 506 for execution in operation 610. The instructionexecution unit 506 analyzes the received instruction to determine if thereceived instruction is a load or a store instruction in operation 615.

[0056] If, in operation 615 above, the received instruction is not aload or store instruction, the instruction is executed as required inthe various stages 504, 506, 508 of the processor core 410 as necessaryto complete execution of the non-load/non-store instruction, inoperation 620 and the method operations on the executed instructionresult ends.

[0057] If, in operation 615 above, the received instruction is a loadinstruction or a store instruction, the received instruction isforwarded to the load store unit 508 in operation 630.

[0058] In operation 635, the load store unit 508 analyzes the receivedinstruction to determine if the received instruction is a crypto loadinstruction or a crypto store instruction.

[0059] If, in operation 635, the received instruction is not a cryptoload instruction or a crypto store instruction, the non-crypto storeinstruction/non-crypto load instruction is executed in the load storeunit 508 (i.e., the prescribed load or store function is performed) inoperation 640 and the method operations for that instruction ends.

[0060] If, in operation 635, the received instruction is a crypto loadinstruction or a crypto store instruction, the crypto load or cryptostore instruction is analyzed in operation 650.

[0061] If, in operation 650, the crypto load or crypto store instructionis a crypto load instruction, the load/store unit 508 transfers the datafrom the crypto co-processor 420 back to the processor 410, in operation655 and the method operations for the instruction end.

[0062] If, in operation 650, the crypto load or crypto store instructionis a crypto store instruction (i.e., not a crypto load instruction), theload/store unit 508 transfers a source operand of the crypto storeinstruction to the crypto control register 510 in operation 660. Inoperation, 665, the crypto co-processor 420 analyzes the data in thecrypto control registers 510 to determine if the stored data identifiesa corresponding crypto command. If the data does not identify acorresponding crypto command, the method operations for the instructionend.

[0063] If, in operation 665, the data identifies a corresponding cryptocommand, the corresponding crypto command is executed in operation 670.

[0064] When the execution of the crypto command is complete, the cryptocoprocessor 420 can send an interrupt to the instruction execution unit506 in the processor core 410 via the logical link 534, in operation 670and the method operations for the crypto command ends. Some cryptocommand can also cause data in the crypto memory 516 to be output to thememory system via the load store unit 508. Conversely, some cryptocommands will cause data in the memory system to be loaded into thecrypto memory via the load store unit 508.

[0065] The crypto co-processor 420 can process data independent of theprocessor core 410 for multiple and even thousands of clock cycles.Pairing the crypto processor 420 with the processing core 410 on thesame die 400 increases the speed of the encryption and/or decryptionprocesses by operating the crypto processor 420 and the processing core410 at the same clock speed. The speed of the encryption and/ordecryption processes is also increased because the data (e.g., datastream) to be encrypted and/or decrypted is not required to betransmitted the relatively long distance between the crypto processor220 and the processor 202, at the much slower speed across theperipheral bus 208 such as described in FIG. 2 above. Further, thepairing the crypto processor 420 with the processing core 410 allows thecrypto processor 420 to directly access the memory system hierarchythrough the crypto processor's 420 DMA engine 518 and the load storeunit 508. This direct memory access allows the crypto processor 420 torapidly encrypt and/or decrypt streaming data at rates much greater thanpossible by a peripheral crypto processor 220 shown in FIG. 2 above.

[0066] As used herein the term “about” means +/−10%. By way of example,the phrase “about 250” indicates a range of between 225 and 275.

[0067] With the above embodiments in mind, it should be understood thatthe invention might employ various computer-implemented operationsinvolving data stored in computer systems. These operations are thoserequiring physical manipulation of physical quantities. Usually, thoughnot necessarily, these quantities take the form of electrical ormagnetic signals capable of being stored, transferred, combined,compared, and otherwise manipulated. Further, the manipulationsperformed are often referred to in terms, such as producing,identifying, determining, or comparing.

[0068] It will be further appreciated that the instructions representedby the operations in FIG. 6 are not required to be performed in theorder illustrated, and that all the processing represented by theoperations may not be necessary to practice the invention.

[0069] Although the foregoing invention has been described in somedetail for purposes of clarity of understanding, it will be apparentthat certain changes and modifications may be practiced within the scopeof the appended claims. Accordingly, the present embodiments are to beconsidered as illustrative and not restrictive, and the invention is notto be limited to the details given herein, but may be modified withinthe scope and equivalents of the appended claims.

What is claimed is:
 1. A microprocessor comprising: a first processingcore including: an instruction decode unit; an instruction executionunit; a load/store unit; a first cryptographic co-processor located on afirst die with the first processing core; and an integer multiplier unitthat is coupled to the integer execution unit and the firstcryptographic co-processor.
 2. The microprocessor of claim 1, whereinthe integer multiplier unit is included within the first processingcore.
 3. The microprocessor of claim 1, wherein the integer multiplierunit is included within the first cryptographic co-processor.
 4. Themicroprocessor of claim 1, wherein the integer multiplier unit is a64-bit×64-bit multiplier unit.
 5. The microprocessor of claim 1, whereinthe load/store unit is coupled to a main memory hierarchy.
 6. Themicroprocessor of claim 1, wherein the first processing core is coupledto a second processing core by a processor crossbar.
 7. Themicroprocessor of claim 6, wherein the second processing core is coupledto a second cryptographic co-processor that is located on a second firstdie with the second processing core.
 8. The microprocessor of claim 6,wherein the second processing core and the second cryptographicco-processor are located on the first die.
 9. The microprocessor ofclaim 1, wherein the first cryptographic co-processor Is coupled to theload store unit.
 10. The microprocessor of claim 1, wherein the firstcryptographic co-processor and the first processing core share theinteger multiplier unit.
 11. The microprocessor of claim 1, wherein thefirst cryptographic co-processor includes: a cryptographic controlregister; a direct memory access engine that is coupled to theload/store unit; a cryptographic memory; and
 12. The microprocessor ofclaim 11, wherein the cryptographic memory is at least large enough toperform a Montgomery multiplication function.
 13. The microprocessor ofclaim 11, wherein the direct memory access engine is coupled to theload/store unit by a 64-bit data bus.
 14. The microprocessor of claim11, wherein the cryptographic control register includes data thatidentifies a type of cryptographic command received in the firstcryptographic co-processor.
 15. A method of executing a cryptographiccommand comprising: receiving a cryptographic instruction in an loadstore unit in a processing core on a first die; determining if thecryptographic instruction is a crypto store instruction; if thecryptographic instruction is a crypto store instruction, then a sourceoperand of the crypto store instruction is stored in a crypto controlregister in a cryptographic co-processor on the first die; determiningif the source operand identifies a corresponding crypto command; andexecuting the corresponding crypto command if the source operandidentifies the corresponding crypto command.
 16. The method of claim 15,further comprising sending an interrupt from the cryptographicco-processor to an instruction execution unit that is included in theprocessing core.
 17. The method of claim 15, further comprisingoutputting a result of the cryptographic instruction to a memory systemusing a load store unit that is included in the processing core.
 18. Themethod of claim 15, wherein executing the cryptographic instruction inthe cryptographic co-processor includes: accessing data through the loadstore unit.
 19. The method of claim 18, wherein the cryptographicco-processor includes a direct memory access engine.
 20. The method ofclaim 19, wherein the cryptographic co-processor includes an integermultiplier unit.